Merge our internal changes and fix conflicts for DF branch 50 by zhuqi-lucas · Pull Request #13 · massive-com/arrow-datafusion

zhuqi-lucas · 2025-09-17T02:38:28Z

Merge our internal changes and fix conflicts for DF branch 50.

* Initial commit * Fix formatting * Add across partitions check * Add new test case Add a new test case * Fix buggy test

…#13909) (apache#13934) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view Co-authored-by: Tim Saucer <timsaucer@gmail.com>

This reverts commit 5383d30.

* fix: fetch is missed in the EnfoceSorting * fix conflict * resolve comments from alamb * update

…it disabled by default

…e#14415) (apache#14453) * chore: Fixed CI * chore * chore: Fixed clippy * chore Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Test for string / numeric coercion * fix tests * Update tests * Add tests to stringview * add numeric coercion

zhuqi-lucas · 2025-09-17T06:52:06Z

        /// (reading) If true, parquet reader will read columns of `Utf8/Utf8Large` with `Utf8View`,
        /// and `Binary/BinaryLarge` with `BinaryView`.
-        pub schema_force_view_types: bool, default = true
+        pub schema_force_view_types: bool, default = false


Default to utf8.

zhuqi-lucas · 2025-09-17T06:53:43Z

+        // Whether to allow truncated rows when parsing.
+        // By default this is set to false and will error if the CSV rows have different lengths.
+        // When set to true then it will allow records with less than the expected number of columns
+        pub truncated_rows: Option<bool>, default = None


Our csv truncated_rows support, will be included in DF 51.0.0, but not for DF 50.0.0.
apache#17465

zhuqi-lucas · 2025-09-17T06:53:52Z

        DecoderDeserializer::new(CsvDecoder::new(decoder))
    }
+
+    fn csv_deserializer_with_truncated(


Our csv truncated_rows support, will be included in DF 51.0.0, but not for DF 50.0.0.
apache#17465

zhuqi-lucas · 2025-09-17T06:54:04Z

+    /// By default this is set to false and will error if the CSV rows have different lengths.
+    /// When set to true then it will allow records with less than the expected number of columns and fill the missing columns with nulls.
+    /// If the record’s schema is not nullable, then it will still return an error.
+    pub truncated_rows: bool,


Same as above.

zhuqi-lucas · 2025-09-17T06:54:21Z

        assert_eq!(
            string_truncation_stats.max_value,
-            Precision::Inexact(ScalarValue::Utf8View(Some("b".repeat(63) + "c")))
+            Precision::Inexact(Utf8(Some("b".repeat(63) + "c")))


We default to utf8.

zhuqi-lucas · 2025-09-17T06:54:33Z

-            .await
+            .await?;
+        let mut id_annotator = NodeIdAnnotator::new();
+        annotate_node_id_for_execution_plan(&physical_plan, &mut id_annotator)


Our internal node_id support.

zhuqi-lucas · 2025-09-17T06:55:11Z

-    let filter = col("date_string_col").eq(lit(ScalarValue::new_utf8view("01/01/09")));
+    // xudong: use new_utf8, because schema_force_view_types was changed to false now.
+    // qi: when schema_force_view_types setting to true, we should change back to utf8view
+    let filter = col("date_string_col").eq(lit(ScalarValue::new_utf8("01/01/09")));


Default to utf8.

zhuqi-lucas · 2025-09-17T06:56:18Z

        self.sink.metrics()
    }
+
+    fn with_node_id(


Internal node_id support.

zhuqi-lucas · 2025-09-17T06:56:26Z

        }
    }

+    fn with_node_id(


Internal node_id support.

zhuqi-lucas · 2025-09-17T06:58:33Z

+/// Infers new predicates by substituting equalities.
+/// For example, with predicates `t2.b = 3` and `t1.b > t2.b`,
+/// we can infer `t1.b > 3`.
+fn infer_predicates_from_equalities(predicates: Vec<Expr>) -> Result<Vec<Expr>> {


Related to:
apache#15906

Should we reopen our upstream PR?

Will do later

zhuqi-lucas · 2025-09-17T07:03:01Z

    }

-    context.update_plan_from_children()
+    Ok((context.update_plan_from_children()?, fetch))


Related to our internal fetch support for enforce_distribution.

zhuqi-lucas · 2025-09-17T07:03:10Z

+            data,
+            children,
+        },
+        mut fetch,


Related to our internal fetch support for enforce_distribution.

zhuqi-lucas · 2025-09-17T07:03:27Z

+    // If `fetch` was not consumed, it means that there was `SortPreservingMergeExec` with fetch before
+    // It was removed by `remove_dist_changing_operators`
+    // and we need to add it back.
+    if fetch.is_some() {


Related to our internal fetch support for enforce_distribution.

zhuqi-lucas · 2025-09-17T07:03:52Z

+use datafusion_common::DataFusionError;
+
+// Util for traversing ExecutionPlan tree and annotating node_id
+pub struct NodeIdAnnotator {


Our internal support for node_id.

zhuqi-lucas · 2025-09-17T07:05:55Z

+/// Execution plan for values list based relation (produces constant rows)
+#[deprecated(
+    since = "45.0.0",
+    note = "Use `MemorySourceConfig::try_new_as_values` instead"


It seems we still use deprecated API, so i can try to upgrade those cases in a follow-up PR.

zhuqi-lucas · 2025-09-17T07:09:43Z

 fn remove_dist_changing_operators(
    mut distribution_context: DistributionContext,
-) -> Result<DistributionContext> {
+) -> Result<(


Internal fetch support.

andygrove and others added 30 commits November 4, 2024 19:40

bump version and generate changelog

9707a8a

bump version and generate changelog

88f58bf

Downgrade tonic

2d5364e

[bug]: Fix wrong order by removal from plan (apache#13497)

2c35f17

* Initial commit * Fix formatting * Add across partitions check * Add new test case Add a new test case * Fix buggy test

Update CHANGELOG

3cc3fca

enforce_distribution: fix for limits getting lost

5383d30

set default-features=false for datafusion in proto crate

13f6aca

Adding node_id patch to our fork

d357c7a

Changes to make streaming work

cbd3dbc

only output node_id in display if it exists

deecef1

include projection in FilterExec::with_node_id

57bf8d6

add missing with_fetch calls to with_node_id method

c431f0f

rework SortExec::with_node_id to not drop preserve_partitioning

fa581d0

set schema_force_view_types to false in ParquetOptions

555ef6b

Revert "enforce_distribution: fix for limits getting lost"

0e3c9e0

This reverts commit 5383d30.

update sqllogictests after disabling view types

a4153bf

fix fetch missed in EnforceDistribution

8ae4a95

fix enforcesorting missing fetch

1ae2702

fix more fetch missing in enforcesorting

38f39f5

fix: fetch is missed in the EnforceSorting (apache#14192)

f7740af

* fix: fetch is missed in the EnfoceSorting * fix conflict * resolve comments from alamb * update

fix remaining test issues regarding with_node_id

22473d9

use new_utf8 instead of new_utf8view in page_pruning test as we have …

f0f6e81

…it disabled by default

Expose more components from sqllogictest (apache#14249)

f3e7004

Extract useful methods from sqllogictest bin (apache#14267)

c976a89

expose df sqllogictest error

ffff7a1

update sqllogictest

63bad11

chore: Upgrade to arrow/parquet 54.1.0 and fix clippy/ci (apach…

e3ea7d1

…e#14415) (apache#14453) * chore: Fixed CI * chore * chore: Fixed clippy * chore Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

Fix join type coercion (apache#14387) (apache#14454)

8f10fdf

Support Utf8View to numeric coercion (apache#14377) (apache#14455)

755b26a

* Test for string / numeric coercion * fix tests * Update tests * Add tests to stringview * add numeric coercion

github-actions Bot added logical-expr optimizer core sqllogictest physical-plan datasource proto common labels Sep 17, 2025

zhuqi-lucas marked this pull request as draft September 17, 2025 02:44

fix proto test

e16c24f

zhuqi-lucas marked this pull request as ready for review September 17, 2025 05:56

zhuqi-lucas commented Sep 17, 2025

View reviewed changes

Comment thread datafusion/datasource/src/source.rs

}

}

fn with_node_id(

Copy link
Copy Markdown

Collaborator Author

zhuqi-lucas Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal node_id support.

zhuqi-lucas commented Sep 17, 2025

View reviewed changes

remove unused file

acd9ddf

zhuqi-lucas commented Sep 17, 2025

View reviewed changes

xudong963 approved these changes Sep 17, 2025

View reviewed changes

zhuqi-lucas merged commit ba0e3a0 into branch-50 Sep 17, 2025
59 checks passed

Conversation

zhuqi-lucas commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants